At our university, phishing emails are a large problem. Many students accidentally give these scammers access to their emails. We saw an opportunity to use machine learning to understand the way you write emails. If our program recognizes that a message is spam, aka not you, we ask for a second form of identification.
What it does
Given a string as input, our code will compare it to our database of email attributes to determine if the email is actually you or spam.
How we built it
My teammate and I used the Scikit-learn python library to incorporate machine learning into our hack. We pulled our emails using the gmail api and oauth.
Challenges we ran into
The gmail api sends the emails back in a base64url encoded message. Every decoder didn't work as predicted, and when we found one that semi-worked, it was html. This html thus needed to be stripped off, and considering html is a context free language, stripping off the html proved to be quite difficult.
Being new to Machine Learning we had some trouble getting our program to run in a reasonable amount of time. Also we initally tried using a Decision Tree for classification but since we had more ham (not spam) examples than spam examples we were getting ham predictions no matter how spamy of an email we gave it. After that we tried using a rbf SVM and got better results.