We took it upon ourselves to try and improve data security using machine learning. Phishing attacks are one of the most common kinds of cyber-crime and fraud, so it is imperative that there are tools out there to help the average person determine if an email is to be trusted or not.
What it does
This program uses a web interface to allow users to either log in to an email account and select emails, or upload email files manually, and it simply determines if the given email is trustworthy or not.
How we built it
For organisation, we split the workload and brainstormed ideas using Trello: an online post board where you can add notes and ideas as a team. This helped us decompose the whole web application into manageable pieces and ensure we all knew what each other were doing and how our progress was getting on.
To make the email classifier we explored data sets on Google's Kaggle and loaded an appropriate one into Python. Using scikit-learn we processed the data and fitted a support vector machine to it so that it can learn which emails are spam and which are not. We chose the best parameters for the most optimal training session, and used the library pickle to finalise the model. This meant that although training took over 10 minutes, the pickled version ran in seconds on the actual website making it appropriate for use in a proper implementation. These functions were then ran in flask.
Challenges we ran into
Different data models for the ML aspect of the project had varying levels of effectiveness; we had to find and implement one that fit our purpose. Being inexperienced with using Git in a team project, we had to deal with many merge conflicts. Centering a box using CSS was probably the hardest part.
Accomplishments that we're proud of
We made a robust web-based interface; and actually managed to get a working final product after many hours of only having separate WIP parts.
What we learned
We are now a lot more familiar with the Python framework Flask, as well as email data and formats; not to mention learning about the ML model we used.
What's next for ChainMail
In terms of the web application, we would like to improve the appearance of the web application by considering devices with smaller widths and the general aesthetic of the template. Also, one of our initial ideas posted on our Trello board was to generate interesting statistics by parsing through the user's inbox. For example, most repeated topics and words.
In terms of the ML model, we would train the model using more data and improve using inputted user data. Also, we would like to experiment and compare different models. Another idea on the Trello board was also for the model to generate new spam / valid emails given a topic to observe what interesting emails the model would return.