Inspiration

As a college student with multiple email accounts, I receive many emails per day. I notice that a large majority of them are spam and quickly fill up my inbox. Deleting them is a tedious task and over time, it gets difficult to identify which emails are top priority and which are spam.

What it does

The machine learning model reads in data from a CSV file downloaded from Kaggle containing more than 5000 email messages and whether or not they are spam messages. The model converts the text into numerical values and reads all the words of the message. If a word is repeated a lot, there will be a higher score value assigned to a variable. Based on the value, it will predict whether a message is spam by outputting a value of 0 if it is spam and 1 if it is not spam.

How I built it

First, I pre-processed the data so the text of the message could be converted into a numerical value. Then I created 2 groups of 2 arrays for the Testing and Training parts of the model. The X variable was the actual email message represented numerically and the Y variable was whether or not the message was spam, represented by binary values of 0 (spam) and 1 (not spam). The model was then trained using the logistic regression function and an accuracy value of 96% was calculated for the Training part of the model.

Challenges I ran into

The hardest part was converting the email text into numerical values that could be fed into the logistic regression model. Additionally, I wasn't very familiar with many of the libraries used in this project so looking up the right ones to use for the testing and training part of this model was challenging.

Accomplishments that I'm proud of

I am proud of the high accuracy rate of the model. Generally in machine learning, an accuracy rate of over 75% is considered good so I am glad the model achieved a value well above that.

What I learned

I learned about the most commonly used libraries for machine learning models. I also learned how to rearrange data to feed into a model. Finally, I was also familiarized with various functions in Python including the TfidfVectorizer and the LogisticRegression functions.

What's next for Machine Learning Model: Spam or Not?

I would like to expand upon my machine learning model by automating it. What I mean by this is once it detects whether or not an email message is spam, I would have the model delete the email from one's inbox if it is spam and keep it if it isn't spam.