I wanted to play with some data so I downloaded Hillary's emails from Kaggle. I also wanted to play with Spotify's Annoy.

I formulated the problem of predicting who emailed Hillary using a machine learning framework. I filtered the emails by those that were sent to "H", a codename for Hillary. I then took all the emails from the 15 people that emailed her the most. I used the body of the email as the features and I attempted to predict the name of the sender. A random guess would be 1/15 or about 7%.

I represented the textual data using tf-idf, one of the most popular ways to represent text. I then used three different classifiers to predict who sent each email. The classifiers I used were:

  1. Naive Bayes
  2. Stochastic Gradient Descent Classifier
  3. k-Nearest Neighbors - implemented with Annoy (Appropriate nearest neighbors oh yeah!)

I was able to achieve over 60% accuracy, which is much better than a random guess, however I did run into some challenges.

  • high-dimensionality

    • although I was just using text information as the features, I still had over 25k features (even after removing common words like "like", "the", and "and"
    • to combat this, I used Random Projections, which not only reduced the time it took to train and predict, but also in some cases improved the accuracy
  • unbalanced dataset

    • there were 3 people that emailed "H" more than everyone else combined. This is a problem as classifiers tend to learn that only these 3 people are important and ignore the rest of the classes (or people). Although this can be advantageous, I wanted to try to learn each class equally
    • to combat this, I divided the dataset into a balanced dataset. This hurt the accuracy across all classifiers but it tells a more telling story.

Overall, Stochastic gradient descent (sgd) was the faster and more accurate classifier using all the tf-idf features, k-nearest neighbors was more accurate but a bit slower than sgd for the reduced dimensions dataset, and the knn did the best in terms of speed and accuracy for the balanced dataset.

Built With

Share this project:

Updates