posted an update

Reflection #1

Marina Triebenbacher (mtrieben) Rachel Yan (ry22) Zachary Mothner (zmothner)

Introduction
In this project we will be implementing a Deep Learning Neural Network Algorithm to predict Twitter users’ political affiliations using a singular tweet. We will be using a GRU model architecture based off of that which is described in the Stanford University paper Predicting U.S. Political Party Affiliation on Twitter. Given that this paper has included their code repository, which contains a model built on the Tensorflow framework, we will be implementing this model on the Pytorch framework. This paper also used the dataset Tweet Congress for their training and testing data, but we will be using the dataset Twitter Politicians. This dataset also includes politicians from countries outside of the United States, which means if we can build a model successful in classifying U.S. political parties, we can attempt to apply it to other countries’ political systems.

Challenges
The hardest part of the project we’ve encountered so far is downloading and filtering the data. To interact with the Twitter Politicians database, we had to hydrate all of the tweets using Twitter’s APIs—i.e., convert them from just a list of Tweet IDs to a JSON containing full text and other relevant information. Although we plan to use only the American politicians from the Twitter politicians dataset, there is no way to filter with just a list of Tweet IDs, so hydrating the entire database took a considerable amount of time and the JSON takes up more than 20GB of memory.

Surprisingly, building the model architecture in Pytorch was not especially difficult even though the three of us have never used Pytorch before. However, we have yet to test our model on actual data, so we are sure there are many challenges yet to come.

Insights
We have begun preprocessing on a subset of our data, which we’ve successfully been able to fetch from the Twitter API, classify, and reformat into a dataframe. We are currently running our script to fetch the rest of the data, which is taking many hours, as it is over a million tweets, but once that is complete, we will have a complete set of data to work with and run through our model. We haven’t tested our model yet, so we aren’t sure of its performance, but we have implemented the model in Pytorch and hope to train and test as soon as we’ve compiled all of our data.

Plan Are you on track with your project? Yes, we are on track with our project. While the preprocessing code was more time-consuming and complicated than expected, we completed it and tested it on a smaller chunk of the data, so now we are just waiting for it to run on the larger dataset.

What do you need to dedicate more time to?
We definitely need to dedicate more time to the model from here on out, but we expect to run the model within the next couple of days and the remainder of our time will be dedicated to tuning hyperparameters and testing the model.

What are you thinking of changing, if anything?
As of now, there is nothing we are planning on changing.

Log in or sign up for Devpost to join the conversation.