Of Mice and Men, Deliverable 2

Introduction

We are planning to implement an existing paper’s computational linguistics model that provides a framework to analyze dehumanization in text data. The main objectives of this paper are to computationally operationalize prior research on dehumanization, such as research that shows that comparisons of social groups to vermin (like rats or parasites) are used to dehumanize that group. There are four main topics that the authors seek to computationalize: negative valence towards the target group, denial of agency to the target group, association of moral disgust with the target group, and comparison of the target group to vermin. Using this framework, the authors were able to track how the term “gay” has become less dehumanizing over time, but the term “homosexual” is still quite dehumanizing. The basic premise of this project is that the model trains word embeddings through a prediction task (predicting the next word in a sequence), and then the model evaluates dehumanization through cosine similarity comparisons to vectors (such as a vector representing moral disgust). Our group was interested in this project because we wanted to do something related to linguistics as well as something that could be used for social good.

Insights

We have received the data from the authors of the original paper, and we have preprocessed the original tsv file by tokenizing the sentences. We were also able to split the entire file into smaller files based on the year each article was published. This should be all of the data preprocessing we need to do.

We have also successfully set up the basic word2vec model architecture using the word2vec library from Gensim. The library has fairly robust functionalities regarding training a word2vec model, saving the model, inspecting the word vector for a specific word, and calculating cosine similarities. We are able to train a word2vec model on the entire NYT data file. By inspecting word vectors of a few sample words from this model (eg. by looking at the top five most similar words to a few common words, such as “American”), we found the results reasonable. We also wrote most of our analysis code that performs weighted averaging on word vectors from a certain word category, eg. words of moral disgust, words of vermin.

Challenges

One of the main challenges actually has been working with a data file that is 14 GB. We had to increase the size of our GCP VM instance for it to have enough storage space to store the data file, but we then found out that our machine type on GCP did not have enough RAM to store the entire training data in memory before the model training starts. Our final solution was training the model on the CS department Grid, which (surprisingly) worked!

Plans

Our next step would be using our baseline model that was trained on the entire NYT dataset and fine-tuning it to create separate word2vec models on each individual year’s data and obtaining word vectors for our target words for all these years. Then, we will apply the four main analyses from the paper for each year (valence of neighbors, dominance of neighbors, cosine similarity to words of moral disgust, and cosine similarity to vermin-related words).

We are hoping that we can meet our target goals within this week. There’s two different reasonable stretch goals that we can think of working on next. First, we have access to a corpus of Reddit comments related to gay marriage from 2006-2017, which is interesting because it includes data from 2015, the year that gay marriage was legalized in the United States. We could train our framework on the Reddit corpus and measure how dehumanizing Redditors’ language was over that time period.

Second, we could use our pre-existing model and word vectors trained on the NYT dataset to perform similar analyses on another set of words (for example, those that refer to Muslim people). This would allow us to apply the same framework surrounding dehumanization in language to another group of people. This would be particularly interesting because the NYT dataset includes data for several years before and after 2001, so it would be interesting to measure rates of Islamophobia over this critical time period.

As a final note, because of our usage of gensim and lack of challenges surrounding data collection, we are unsure if we have done enough deep learning work for this project. As such, we would appreciate any feedback to bulk up that portion of our project.

Built With

word2vec

Updates

Holly Zheng started this project — Nov 22, 2020 09:38 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.