The Reboot: A Deep Learning Sitcom Script Generator
Project Check-In #3
Final Report
Poster
Who
- Grace Lee — glee73
- Jian Cong Loh — jloh4
- Riya Dulepet — rdulepet
Introduction
The end of a hit sitcom series like Friends and The Office is invariably followed by calls from fans for a reboot of the show. As fans of the sitcom genre, our team, therefore, decided to create a deep learning model that is able to generate new lines for a particular sitcom after being trained on its existing scripts. Our project involves language modeling, which is a self-supervised machine learning task. Using deep learning to generate a sitcom script is a complex and interesting problem as we are not only interested in generating natural language, but would like to produce text that semantically resembles lines from a particular sitcom.
Related Work
GPT-2, a state-of-the-art language model, has an architecture that is based on a decoder-only transformer. Unlike the original transformer architecture which has a series of encoder and decoder blocks, GPT-2 consists only of decoder blocks that contain a masked self-attention layer and a feed-forward neural network. In masked self-attention, tokens that come after the token that we are computing the self-attention scores for are masked such that the attention is restricted to the tokens that have already appeared in the sequence.
References:
- https://jalammar.github.io/illustrated-gpt2/
- https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- https://jaketae.github.io/study/gpt/
- https://github.com/karpathy/minGPT
- https://huggingface.co/docs/tokenizers/quicktour
Data
Our main data comes from the Friends TV Show Script dataset on Kaggle, which contains the script for all episodes of the Friends sitcom. In preprocessing, we filtered out lines that contain episode information and stage directions to leave only lines with character dialogues. We also parsed out the character who spoke the line as a separate list.
To pretrain our model, we used the WikiText-2 language modelling dataset which contains Good and Featured articles on Wikipedia. In preprocessing, we removed lines that correspond to headers to leave only the article text and separated each line, which contains a single paragraph in the article, into individual sentences.
To tokenize and prepare our data for the model, we used HuggingFace’s Tokenizer API to build and train a byte-pair encoding tokenizer (BPE) with a vocabulary size of 10,000. BPE is a subword tokenization algorithm that starts with all the characters in the corpus as tokens and repeatedly identifies and merges the most common pair of tokens until the vocabulary has reached the specified size. Used in advanced language models like GPT, BPE allows us to represent rare words while keeping the vocabulary size tractable by breaking complex words down into subtokens.
We used the trained tokenizer to encode both datasets and transform them into appropriate inputs for our model by adding special tokens:
[BOS] - beginning of sequence [EOS] - end of sequence [DELIM] - delimiter for finetuning [PAD] - padding
Methodology
Our model will replicate the decoder-only transformer of GPT-2. Specifically, our inputs will first pass through a token embedding and positional encoding layer, followed by a number of decoder blocks that each comprise masked multi-headed self-attention and a feed-forward neural network.
Figure 1: Decoder-only transformer used in GPT
(Source: - Radford et. al., Improving Language Understanding by Generative Pre-Training)
Given that GPT-2 has billions of parameters and is pre-trained on an extremely large dataset, it will not be possible for us to create a general-purpose model that is as powerful. However, given that our task is more specific than general-purpose language modeling, leveraging the same model architecture can hopefully allow us to nonetheless produce a model that is powerful enough to learn the semantics of sitcom lines and perform well on our task.
Instead of pre-training the model on a large corpus of text and fine-tuning it for our specific task, we will directly train the model to generate sitcom lines using our sitcom lines dataset. A single training input into our model will be a tokenized sitcom line of a fixed size, preceded by a token representing the character that spoke the line and a delimiter, e.g.: [<start>, Kevin, <delim>, why, waste, time, say, lot, word, when, few, word, do, trick, <end>, <pad>]. In training, the model will start its prediction from the 4th position, i.e. it will try to predict waste from [<start>, Kevin, <delim>, why], and proceed on to predict words in the sequence in order. Once the model has been trained, we can then supply the model with a character and the first word of the line and allow it to generate a new line for the character.
To improve the performance of our generator, we are also considering the use of a generative adversarial network (GAN). We can train another neural network (the discriminator) to distinguish between the original and model-generated sitcom lines and train our generative model to produce sitcom lines that are harder to distinguish from the original.
If we run into issues with implementing the decoder-only transformer, we could also experiment with using a conditional variational autoencoder where the encoder and decoders will use transformer blocks similar to what we used for machine translation.
Metrics
As with other language models, we can evaluate our model by looking at its perplexity on the test set. We cannot determine an appropriate threshold for the perplexity score that would constitute success ahead of time as the perplexity score is dependent on the corpus and our choice of vocabulary. Nonetheless, we can use the model’s perplexity score as an evaluation metric to guide our efforts to improve our model.
While we do not have an objective quantitative metric to measure success, we have a series of informal checkpoints for our model. Our base goal is to generate a single line that is syntactically correct and semantically makes sense in the context of the sitcom and the given character. Our target goal is to generate multiple lines for the sitcom that are semantically correct in relation to one another. Lastly, the implementation of a GAN to improve the performance of our generative model will be a stretch goal.
Ethics
Deep Learning is a good approach to this problem as we would like to generate natural language that is semantically similar to our training corpus. Deep learning is typically most appropriate for complex problems that require discovering hidden patterns in the data and having a deep understanding of the complex relationships between variables, which is necessary for original text generation.
We are using lines from the show Friends that have been collected, labeled, and published on Kaggle by a user. We are not particularly concerned about the reliability of the data but are worried that the model may generate lines that are harmful and reflect harmful biases, given that Friends has a number of inappropriate and offensive jokes. A possible way to prevent this from happening is to filter these inappropriate jokes out from our dataset so that our model is less likely to learn and reflect these harmful biases.
Division of Labor
Riya will focus primarily on data pre-processing, while Grace and Jian Cong will collaborate on developing the model architecture.
Log in or sign up for Devpost to join the conversation.