Predicting Who Sent Hillary Clinton Emails

I wanted to play with some data so I downloaded Hillary's emails from Kaggle. I also wanted to play with Spotify's Annoy.

I formulated the problem of predicting who emailed Hillary using a machine learning framework. I filtered the emails by those that were sent to "H", a codename for Hillary. I then took all the emails from the 15 people that emailed her the most. I used the body of the email as the features and I attempted to predict the name of the sender. A random guess would be 1/15 or about 7%.

I represented the textual data using tf-idf, one of the most popular ways to represent text. I then used three different classifiers to predict who sent each email. The classifiers I used were:

Naive Bayes
Stochastic Gradient Descent Classifier
k-Nearest Neighbors - implemented with Annoy (Appropriate nearest neighbors oh yeah!)

I was able to achieve over 60% accuracy, which is much better than a random guess, however I did run into some challenges.

high-dimensionality
- although I was just using text information as the features, I still had over 25k features (even after removing common words like "like", "the", and "and"
- to combat this, I used Random Projections, which not only reduced the time it took to train and predict, but also in some cases improved the accuracy
unbalanced dataset
- there were 3 people that emailed "H" more than everyone else combined. This is a problem as classifiers tend to learn that only these 3 people are important and ignore the rest of the classes (or people). Although this can be advantageous, I wanted to try to learn each class equally
- to combat this, I divided the dataset into a balanced dataset. This hurt the accuracy across all classifiers but it tells a more telling story.

Overall, Stochastic gradient descent (sgd) was the faster and more accurate classifier using all the tf-idf features, k-nearest neighbors was more accurate but a bit slower than sgd for the reduced dimensions dataset, and the knn did the best in terms of speed and accuracy for the balanced dataset.

Built With

pandas
python
sklearn

Updates

Ben Lawson started this project — Apr 30, 2017 02:46 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.