Inspiration
Originally, we were interested in learning more about neural networks and using its pattern recognition to detect threats that we may miss as humans. After brainstorming, we realized that a phishing email detector would be perfect since it is able to use the context of a large dataset of phishing and legitimate emails to detect fraudulence.
What it does
Using a web application, users are able to sign in using their personal Gmail account and allow our program to analyze the legitimacy of their previous X amount of emails without storing their information in a data base. If our model discovers a suspcious email message, the user is alerted to be cautious. Currently, our model performs with 95% accuracy using unseen data.
How we built it
We were able to train a neural network using Google Colab and Kaggle. First, we processed our information from the .csv file into a readable format by excluding unreadable characters and removing stop words. Removing stop words helps our program focus on analyzing the other word phrasings that have more significant meaning. For example, we want a greater emphasis on "free", "credit card", and "prince" rather than "the", "with", and "him". Afterwards, the program tokenizes the email and maps its position into a 128th dimensional list. This is an important step to identify synonyms or similar words. Afterwards, we designed our model architecture specifically for natural language processing (NLP) and begun training our data. Once trained, we creating a testing loop to determine the accuracy.
Next, we needed a way to integrate our trained model into a more readable and interactive user face. We integrated our model in an application using React as our front end and Flask as our backend. It works by logging into your email, using the Gmail API to fetch to your previous X amount of emails, and finally using our trained model to perform inference on the email bodies to confirm legitimacy.
Challenges we ran into
It was difficult to understand which model architecture to use since NLP was fairly new to us. We also ran into some difficulty processing the data since it some rows happened to contain invalid characters which originally broke our parser. In addition, the Gmail API originally only allowed us to receive snippets of our inbox, which caused inconclusive results since our emails were too short.
Accomplishments that we're proud of
In no particular order, we were proud of several aspects of our program:
- Our final design didn't have to store any of the emails since it would pose an even greater security risk if it did.
- Our model accuracy far more accurate than we anticipated. It even correctly identified DevPost as a legitimate email and a "wealthy British investor" as phishing.
- It is actually able to communicate with your emails so there isn't a need to copy-and-paste every email you receive. Simply log in and view.
- Although machine learning is fairly complicated, we were proud that we understood each step of our program (if not, a high level understanding of how it worked)
What we learned
We learned the following:
- Processing and preparing unorganized data from Kaggle
- Using React and Flask for full stack development
- Train a neural network using Google Colab and understand the underlying concepts
What's next for our project
Now that we have correctly identified phishing and legitimate emails, we can then analyze the data and detect deeper trends. For example, we should blacklist emails that have consistently been flagged for phishing/spam.
Log in or sign up for Devpost to join the conversation.