A Debiased Deep Learning Model for Generating Humorous Memes

Jieyun Li (jli197), Yitong Wang (wyitong), Yixun Kang (kyixun), Zhihao Li (lzhihao)

GitHub: https://github.com/LeoLi1223/2470-final-project/tree/main

Introduction

The rapid rise of online social media has propelled memes into a dominant form of cultural expression. Memes often include images, texts, or videos that are quickly passed around and sometimes changed by internet users. These memes frequently feature elements from popular culture, such as movies, TV shows, celebrities, and personal experiences, underscoring their potential as vehicles for humor and cultural commentary.

Given how common memes are, there’s significant interest in enhancing their appeal through innovative captioning. However, the creation of memes is not without challenges, notably the presence of biases that can skew their content and impact. Recognizing the importance of addressing these biases, efforts are being directed towards the development of debiased neural networks. Our model aims to generate humorous memes by tackling the critical issue of bias, thus ensuring that memes remain a source of enjoyment and reflection for a diverse global audience.

This project is a structure prediction problem. It requires creating a model that can accurately generate captions based on visual content, recognizing humorous elements, and focusing attention on relevant image parts. Additionally, the model must identify and mitigate biases within the meme generation process. The challenge extends beyond predicting textual content; it involves understanding how text interacts with visual elements to create an outcome that is humorous and appropriate for a broad audience. Our approach involves structuring the humor in the context of the imagery, ensuring that the captions generated are coherent, contextually relevant, and adhere to our debiasing objectives.

METHODOLOGY

Data

This project utilizes a dataset provided by a Ph.D. student from the Skolkovo Institute of Science and Technology (Skoltech) in 2020 [1]. The dataset comprises 300 meme templates, each accompanied by 3,000 captions. Given that our dataset contains only 300 meme templates, which is a relatively small sample size, we have decided to implement an 80:20 training/testing split.

Table 1

Table 1. Samples from the dataset.

Preprocessing

Initial Caption Preprocessing

We began the data preprocessing by constructing a word dictionary from all captions. Words that occurred fewer than three times were replaced with an unknown token to reduce the noise in the data caused by rare words. Captions were further filtered to remove those containing more than two tokens, ensuring that the quality of the data fed into our model was maintained. After filtering, we randomly selected 100 captions for each meme template. This step was aimed at managing the dataset size and diversity effectively, providing a comprehensive but manageable amount of text data for training.

Image Feature Extraction

To complement the textual data, image features were extracted using the InceptionV3 model, a pre-trained deep convolutional neural network that is highly effective for image analysis. We applied a Global Average Pooling (GAP) layer to summarize the features extracted by the network into a 2048-dimensional vector per image. Each image was resized to 299 x 299 pixels before being processed by the InceptionV3 model. The images were preprocessed to align with the input requirements of the model, ensuring accurate and effective feature extraction.

Final Data Compilation

The preprocessed captions and the extracted image features were compiled into a single data file. This file integrates text and image data, facilitating the subsequent stages of model training and testing. The structured pairing of image features with corresponding captions is crucial for training our models, as it enables the models to learn the association between visual content and textual descriptions effectively.

Model Architecture

The architecture of our model is designed around an encoder-decoder structure that effectively integrates visual and textual elements for generating captions. We evaluated two candidate models: an RNN with LSTM-based encoder-decoder and a Transformer model that includes a Multi-Headed Attention mechanism and a Dense Layer. While both models were designed to handle the intricacies of combining text and imagery, the Transformer outperformed the RNN, showing more coherent and contextually relevant captions. Thus, we chose the Transformer as our final model.

Table 2

Table 2. Comparing the outputs of the two candidate models.

In detail, the Transformer's workflow starts with an image encoder that converts visual inputs into a compact, high-dimensional feature representation. This is complemented by the textual processing component, where captions undergo word embedding and positional encoding, preparing them for the subsequent Transformer decoding stages. The decoder itself features several layers of multi-headed attention and feed-forward networks, crucial for synthesizing the textual output from the combined image and caption information. In addition, we set our learning rate to 0.001, batch size to 100, and used the Adam optimizer.

Figure 1. Transformer's architecture.

To ensure the appropriateness and inclusivity of the captions, the model incorporates a RoBERTa-based layer, specifically fine-tuned to identify and mitigate offensive or biased language [2][3]. This debiasing process involves analyzing the initial unfiltered captions generated by the Transformer using the RoBERTa model, which has been extensively trained on social media data [2]. The RoBERTa model provides an offensive score that indicates the level of potential bias in the generated captions, guiding us in identifying those that may require further refinement. Captions that exceed a predetermined threshold of potential offensiveness, specifically set at an offensive score of less than 0.3, are then reprocessed to refine their content, by adding randomness. The final output consists of clean, filtered captions that are free from undesirable biases, ensuring they maintain the humorous intent of the memes while being suitable for a broad audience.

Figure 2. Incorporating the RoBERTa-based layer with the transformer.

RESULTS

Quantitative

We primarily relied on one key metric, average perplexity, to guide our tuning of the number of captions inputted per image and epochs. Average perplexity measures the model's uncertainty in predicting the next word in a caption. In analyzing the average perplexity trends during training, it's evident that the model trained with 100 captions per image consistently exhibits the lowest average perplexity across different epochs. This indicates a more stable and effective learning process compared to the other sets with 50, 150, and 300 captions. Given this consistent performance advantage, we decided to utilize 100 captions per image in our training process. This approach ensures optimal learning dynamics and model performance, contributing to more accurate and coherent caption generation.

We found out that the perplexity went down from 400 to 200 from epoch 1 to epoch 6 and slightly increased. However, judging from the captions created, we observed that the text is most coherent at epoch 10 and the perplexity does not increase too much, so we decided to go with 10 epochs.

Figure 3. Average perplexities for different numbers of inputted captions across 10 epochs.

Figure 4. Average offensive scores for filtered and unfiltered captions.

In terms of content appropriateness, the average offensive score metric from the RoBERTa-based evaluation provides insights into the effectiveness of our debiasing filter. The graph comparing unfiltered and filtered captions over five selections of 20 images shows a consistent reduction in offensive scores for the filtered captions. The average offensive score reduces from 0.25 to 0.2, which shows a 20% reduction. This demonstrates the model's ability to generate captions that are not only contextually appropriate but also less likely to perpetuate biases or offensive content, aligning with our goal to produce inclusive and respectful meme captions.

Qualitative

Table 3

Table 3. Generated outputs.

Table 4

Table 4. Unfiltered vs. filtered captions.

We can see the captions are more neutralized and gentle and there are sharp decreases in offensive scores, which proves the effect of bias filtering.

CHALLENGES

Dataset Limitations

The primary challenge stems from the limited size and quality of our dataset. With only 300 images, our sample size is relatively small, which could hinder the model's ability to learn effectively. Additionally, the presence of repetitive and nonsensical captions within our dataset poses further challenges. These captions can confuse the model during the training process, leading to less coherent outputs. Augmenting our dataset or manually refining the captions to enhance quality is a labor-intensive and time-consuming process, requiring meticulous attention to detail to maintain the integrity of the dataset.

Debiasing Method Limitations

Another significant challenge arises from our choice of debiasing methodology. Initially, we planned to employ cutting-edge debiasing techniques to enhance the robustness and fairness of our model. We considered several approaches, including DEAR (Debiased with Additives Residuals), adversarial learning, and a discriminator and generator pair commonly used in Generative Adversarial Networks (GANs) [4]. Each of these methods offers potential advantages in handling biases, especially those related to protected attributes.

However, implementing these advanced methods proved to be complex and challenging within the constraints of our project timeline. The literature on DEAR, while promising, does not provide explicit implementation guidance, leaving significant uncertainty in how to effectively apply and adjust the additive residuals. Additionally, defining an appropriate adversarial loss function for the adversarial network approach presented considerable difficulties. We also entertained doubts about the efficacy of GANs for debiasing in our context and suggested an initial pre-training phase using the Protected Attribute Tag Association (PATA) dataset to evaluate their potential [4].

Further complicating our approach, the suitability of the PATA dataset for our specific needs came into question. The PATA dataset is typically more effective for tasks that require accurate image description through captions. However, meme captions, which often do not directly describe the images but rather add a humorous or satirical twist, may not benefit as much from this dataset, raising concerns about its effectiveness in our debiasing efforts.

Given these complexities, we opted to use RoBERTa to detect and filter offensive content. While RoBERTa is a robust model commonly used in natural language processing, it is not optimally suited for debiasing content. This approach has limitations in accurately identifying and mitigating all forms of biased content, particularly those stemming from protected attributes. This constraint means that our final model, while effective in reducing overtly offensive content, may not fully address more subtle biases inherent in the data.

REFLECTION

Looking back on the results of our project, we think it was successful overall, though there is definitely room for improvement. Our base goal and target goal were to develop a model capable of generating coherent meme captions, and implement RoBERTa to filter out the biased captions with an offensive score > 0.5, correspondingly. To this end, the project has largely met its base and target goals. The Transformer model integrated with RoBERTa for debiasing initially performed as we hoped. Our stretch goal aimed to enhance the integration of RoBERTa by adding a filter layer to replace offensive or biased words, targeting a 5% reduction in offensive scores to ensure that captions are not only coherent but also fair. However, instead of implementing a method to replace biased words directly, we adapted our strategy to let the model regenerate a new caption if the detected offensive score exceeded a threshold of 0.3. This approach allowed us to mitigate bias by preventing biased captions from being used, rather than altering individual words within captions. In general, the model worked reasonably well, especially in terms of generating coherent captions and reducing explicit biases. Nonetheless, it was not as effective in addressing more subtle biases related to protected attributes, which was not entirely unexpected given the limitations of the debiasing technique we employed.

The model generally performs as expected. However, we observed that occasionally, the meme captions generated may receive higher offensive scores after passing through the RoBERTa layer. This could be attributed to the upper bound we set for adding randomness to the captions to prevent an infinite loop. Consequently, the newly generated captions, which are more randomly generated, might receive higher offensive scores.

Our project underwent several strategic adjustments. Initially, we intended to employ advanced debiasing methods such as DEAR and adversarial learning. However, due to the lack of clear implementation guidelines and the inherent complexities associated with these methods, we opted for a more practical approach using RoBERTa. While this adjustment allowed us to achieve timely results, it necessitated compromising on the depth of bias mitigation. In hindsight, integrating RoBERTa into our loss function rather than placing it as the final layer could have been more effective. This modification would have better leveraged RoBERTa's capabilities to minimize offensive scores, potentially yielding improved outcomes. Although we refrained from implementing this method due to its high computational demands, it remains a viable option for future exploration.

To enhance the project, we should consider data augmentation, as the current dataset of 300 meme templates may be insufficient for the model to capture relevant image features. Furthermore, we should explore the DEAR method to account for additional protected attributes, such as race, age, and gender, thereby ensuring that the model remains equitable. Additionally, we should investigate techniques that allow the model to learn and generate humorous captions, preserving the comedic essence of memes.

This project highlights several key aspects of model development, particularly in natural language processing. One major takeaway is the importance of dataset quality and diversity in training models, especially those intended for social purposes like meme generation. We also learned about the trade-offs between model complexity and practical implementation, which are crucial considerations in real-world applications. Moreover, the project underscored the significance of being flexible and adaptable in research and development. The ability to pivot and adapt to new information or constraints is invaluable and something we will carry forward into future projects.

REFERENCES

[1] https://drive.google.com/file/d/1j6YG3skamxA1-mdogC1kRjugFuOkHt_A/edit

[2] Groberta: Pre-trained embeddings for offensive ... (n.d.). https://www.cmu.edu/ideas-social-cybersecurity/events/conference-archive/archive-conference-2021/conference-papers-2021/conference-2021-paper-10-groberta.pdf

[3] Cardiffnlp/twitter-roberta-base-offensive · hugging face. cardiffnlp/twitter-roberta-base-offensive · Hugging Face. (n.d.). https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive

[4] Seth, A., Hemani, M., & Agarwal, C. (2023b). Dear: Debiasing vision-language models with additive residuals. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr52729.2023.00659