In machine learning class, our professor was talking about Naïve Bayes Classifiers and how good they are at classifying spam, so we decided to implement one to cut down on the phishing emails we get at WSU.
What it does
We calculate the probability of a given email being spam or not spam based on the words contained in it. Whichever has the higher probability is how the email is classified. In addition, we provide verification of an authentic DocuSign link in the body of an email if one is present.
How we built it
Since we wanted to build our model specifically for the purpose of filtering WSU spam, we used WSU emails as data. To do so, we used a VBA script to export emails from outlook. From there, we used Python to convert the emails into a readable format, hand label the data, create test and training sets for the data, and finally build our model.
Challenges we ran into
Accomplishments that we're proud of
Our model has a 96% accuracy rate with 0 non-spam emails marked as spam.
What we learned
Lots of Python (decimal is very helpful when you have numbers on the order of e-100), Encoding is fun
What's next for Project name
We hope to extend our model into an Outlook extension. From there we could do things like flagging spam directly in the inbox and improving detection of phishing attempts through services like DocuSign since we would have access to more information (subject line, sender, etc)