The Office Character Dialogue Generation Using Images

Team: Assistants (to the) Regional Programmer

Contributors: Noah Foster (nfoster3), Chloe Griffin (dcgriffi), Abubakarr Jaye (ajaye), Christine Tseng (ctseng6)

UPDATE: Final Deliverables

Please view our final report linked here

Please view our final poster linked here

Please view our final presentation linked here

Please view our GitHub Repository here

UPDATE: Check-in 2: Reflection Link

Please view our reflection linked here.

Introduction: What problem are you trying to solve and why?

In this project, we hope to build a quote generator for the television show The Office (2009). In particular, the quote generator will take in an image, or still frame, from the show, identify the present characters, and output lines based on what the characters would say. Thus, our project is like image captioning, but with a twist – we now grab context from the entire TV show’s scripts to inform the “captioning” of a still frame through generating potential character quotes.

There have been many attempts to create models that generate captions for images; however, there hasn't been any attempt to generate dialogue based on an image. This is interesting work since it can help screen writers and manga writers generate content or ideas when creating shows and movies. The problem is thus a classification and structured prediction task. The model needs to classify the characters in the image and also generate text for each character.

Related Work:

We found many attempts to generate captions for images using deep learning: for example, the author of this blog uses a CNN-RNN model for generating captions given an image. In the blog, the author includes the code for tokenizing the vocabulary available for use in captioning, as well as implementing a Keras Model of their defined CNN-LSTM model. In particular, we hope to pull from the blog’s general architecture of feature extraction (CNNs), sequence processing (LSTM), and decoding (merging the models) in order to similarly build our architecture of 1) understanding a still frame, 2) utilizing a language model, and 3) outputting an actual quote. Additionally, this survey and this paper show many different applications of deep learning for image captioning. For the image classification portion, we found an example of using ResNet to train on The Office datasets. In addition, CLIP is a seminal paper that has proposed an innovative solution for aligning images and texts. We also know some old papers such as Devise also proposed a unique way of incorporating textual information in visual models.

Data: What data are you using (if any)?

We are using several datasets gathered from Kaggle for our model. The first contains character images from The Office (The Office Characters). If we need to include additional faces for our model beyond the characters included above, we found a collection of celebrity faces (CelebFaces Attributes (CelebA) Dataset). The next includes quotes from the show (The Office Lines), which are sorted by character name. We also have the entire transcript of the show (Complete Dialogue/Transcript) with speaker names for additional data to train on if needed. Lastly, we found a sentiment analysis (The Office Sentiment Analysis) on episodes and lines. This contains an analysis of the show transcript, which could provide additional information for training our language model.

For our image data, we will do the standard preprocessing as we did with the CIFAR datasets (image augmentation, resizing for neural network, etc). The images are already labeled and are pretty consistent in style. There are a few images which contain drawings or alternative depictions of the characters. For example, there is a pop figure of The Office character Angela. We may choose to remove some of these depending on training performance. The character images dataset is only 1523 images with size 19.33 MB. The celebrity dataset (if needed) is significantly larger at 1.45 GB. For the language datasets, The Office Lines will not require much preprocessing (as indicated by a usability metric of 10.00 on Kaggle!), but it does contain additional information about sounds in the scenes within square brackets, ex: [Telephone rings]. These can be removed in preprocessing if necessary. The Office Lines data set is 4.83 MB. The Complete Dialogue/Transcript is similar in size (4.81 MB). It has a lower usability score, but does not contain the brackets mentioned above. Both of the datasets are useful and we can decide which one we want to focus on as we begin the training process.

Methodology: What is the architecture of your model?

We will have two network components in our model. The first network will likely be a CNN model used to classify the character in the image. Alternatively, we can fine-tune a Res-Net model for image classification as we’ve seen in similar work. We hope to incorporate additional information about the image using CLIP. We could consider the general environment of the image (i.e. outside, inside, time of day, etc.) or other objects in the image (office supplies, food, etc.). If we have time, we may also include a facial expression recognition component which can better inform our quote generator. For the second part of our model, we will fine-tune a Large Language Model (likely GPT-J) on the transcript of The Office. We choose to use a pre-trained LM since models that we can train from start to finish on our own computers will not be able to output any dialogue that sounds remotely like coherent english, much less a specific character. In its totality, the combination of the two discussed networks is an approach we have not seen in previous image captioning or language generation models. This will allow the language generator to provide quotes which match the personality and characteristics of the character seen in the image. While we will start with a small model which can detect the character and output associated dialogue, we will fine tune our model to make more informed decisions based on the image context. We may also include multiple characters for more extensive dialogue. Ultimately, we will combine the output of our two networks into a transformer in order to generate quotes.

Metrics: What constitutes “success?”

Our experiments will consist of feeding still frames from The Office into our model and seeing what quote for which character is generated. We will test our still frames with only one character (normal classification) as well as still frames with multiple characters present (multi-class labeling classification). Since we are working with multiple component networks, we need evaluation metrics in multiple places of our model. In our first network of image classification, we expect evaluation metrics similar to those from image classification homeworks, such as loss measured by categorical cross-entropy and accuracy measured by confusion matrix sensitivity/specificity metrics. In our second network of language modeling, we expect evaluation metrics similar to those from LM homeworks, such as loss measured by sparse categorical cross-entropy and accuracy measured by perplexity.

Our goals are as follows:

Base Goal: to generate text resembling a script.
Target Goal: to generate text that could plausibly have been spoken in the still frame.
Stretch Goal: to generate text that sounds like a character from The Office in the still frame.

Ethics:

1. Why is Deep Learning a good approach to this problem?

Deep learning is necessary for this problem in several technical ways. First, it allows us to use image classification techniques to recognize character faces. Additional layers may allow us to detect characteristics of the image for better dialogue output in the language model. Furthermore, it allows us to produce new written text in the same style as previously written text. In our case, we are using multiple layers to train the model to generate similar dialogue as included in the tv show. Without deep learning, this task would only be able to be completed by having humans generate new text and label the images themselves. Using deep learning, we allow a computer to complete this job for us. Given the technical merits of deep learning in our problem, we find it worthwhile to explore deep learning as a solution and remain cognizant of monitoring our use case to indeed remain a fun and helpful tool.

2. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The major stakeholders are the creators of the show and the fictional characters in The Office that we are portraying. Consequences could be that if we use a base model that has been trained on data from the internet, biases of the data might be learned by the model causing it to generate quotes that are thus biased against particular groups. This could reflect poorly on the characters and the authors. Nevertheless, we can mitigate this by including a disclaimer which explains that the outputs of this model do not reflect the thoughts or opinions of the show creators. Furthermore, we can include an explanation that our results are a reflection of the training data and models used.

Division of labor:

With our combination of two networks, we expect to split two group members to each network. We currently have Chloe and Christine working on the image classification/processing, and Abu and Noah working on the Language Modeling. After finalizing the component networks, we will combine together to work on merging the work into a transformer to generate quotes.