Fraud Email Detector - IM16

Inspiration

In today's digital age, there's a growing problem with fraudulent emails, particularly phishing attacks. These deceitful emails aim to trick people into revealing sensitive information like passwords, financial data, and personal details. This poses a significant risk to individuals and businesses, as evidenced by the fact that 83% of UK businesses hit by cyberattacks in 2022 reported falling victim to phishing.

Detecting and stopping these fraudulent emails is crucial for protecting email users' security and privacy. Using machine learning and natural language processing, we aim to develop a practical solution to effectively combat this issue and make the digital world a safer place.

What it does

Our Fraud Email Detector runs an NLP-based machine learning algorithm which reads the contents of a user's email and then determines whether it is a fraudulent email or not with 99% accuracy!

How we built it

The Fraud Email Detector was made utilising the IBM z/LinuxONE systems to process and analyse data from our dataset and to feed into our machine learning algorithm. The dataset consists of almost 12,000 entries containing the email content and a class label to signify whether the email is fraudulent or not. So we ran NLP techniques, including tokenisation, removing stopwords and stemming, on the dataset to clean and prepare the data to be fed into the TF-IDF algorithm which computes the importance of words/phrases across all of the emails. Once this has been computed for all of the data entries, the now transformed data can be used to train our machine learning algorithm. For this we used a Random Forest Classifier. Finally, we created a website to have an interface for our model using the Parcel framework which uses HTML, CSS and JavaScript.

Challenges we ran into

The biggest challenge was trying to figure out the best way to process the dataset so that it can be inputted into a machine learning algorithm. We thought of processing the data into features such as sentiment values, but we eventually came across the TF-IDF algorithm during our research, which we thought definitely fit our use case the best.

Accomplishments that we're proud of

We all initially had no knowledge of anything related to NLP and so we are very proud that we were able to finish a project within a day in an area we are not familiar with. Also, we are proud that we managed to get an ML model with an accuracy of 99%.

What we learned

As mentioned earlier, we all had no knowledge of NLP so this was definitely the main thing that we learned about. This included NLP terms such as tokenisation, n-grams, lemmatisation, stemming, sentiment analysis and POS tagging.

What's next for Fraud Email Detector - IM16

In the future, we aim to migrate the website to a Google extension to make it more accessible to users. The Google extension would automatically read the email's contents and then indicate to the user whether the algorithm thinks it's fraudulent or not. Then to improve the learning of the algorithm, we would introduce a feedback system to allow users to mark an email which can then be used as further training data for the model. Finally, to improve the ML algorithm we would try to utilise deep learning on a much bigger dataset to maximise the accuracy of the algorithm.