Fishing the Phish

Fishing the Phish

Inspiration

The world has increasingly become dependent on the internet for a majority of its tasks. Whether it is simply browsing for homework, finding a recipe for cooking, buying various items of need or the critical banking transactions, the internet has become an indispensable part of our daily lives. Therefore it comes as no surprise that institutions implement various security measures to mitigate and also prevent the cyber-security attacks. Phishing is one of the many security attacks of this spectrum. In the above attack, a user wants to gain customer-sensitive information, by masquerading as a legitimate website. One of the real-time examples includes enormous phishing mails sent to people during Football World cup, guaranteeing tickets to Moscow and extracting the personal information of the user. Along with the danger it poses to the user, phishing also tarnishes the brand image of a company and disperses distrust in using the application/service. With the stakes this high, it is of prime importance that we limit the phishing attacks and have a methodical approach of identifying them to build robust systems.

The following problem can be solved efficiently by using the techniques of Machine Learning and Data Science. Various features can be extracted from the URL of the websites to make a distinction between a legitimate website and a phishing website.

What it does

Detects phishing websites based on their lifetime details and URL using machine learning models.

How we built it

For our project, we extract a feature set consisting of the length of the URL, number of slashes in the URL, number of words in the host-name of the URL, the number of dots in the URL along with parameters like the date of URL creation, expiration date of the URL and the last update date of the URL.

We divided features with numerical values into one set and features extracted from URL text directly into another set, and applied a variety of machine learning models in our analysis. The models include Logistic Regression, K - Nearest Neighbors, Decision Trees, Random Forest, Adaptive Boosting and Long Short Term Memory networks. The measure of correctness is expressed by F1-score as it is an excellent score for classification problems with even imbalance class size and finds a good balance between precision and recall. Instead of choosing one learning model, we let our algorithm decide the optimal model for each set of features based on F1-score values. We finally combined the models created from two feature sets, which resulted in a model with much better phishing detection capability.

Challenges we ran into

Feature engineering based on URLs, combining multiple ML models trained on different features to solve the same problem.

Accomplishments that we're proud of

Dynamic selection of best model among many ML models and fusing models trained on different set of features to produce better result than individual models.

What we learned

Solving a real world data science problem, Feature engineering, Hyper-parameter tuning, Model fusion

What's next for Fish the Phish

Engineer more relevant features and train on data with much larger number of samples for improving the accuracy of our model.

Built With

Submitted to

ShellHacks 2019
- Winner Akamai - Security Challenge

Created by

Exploratory data analysis, Feature engineering, Machine learning and deep learning model selection, hyper-parameter tuning for the selection of optimal model, Model fusion for improving Phishing detection capabilities using Python, Keras, scikit-learn, pandas and Seaborn.

Keerthiraj Nagaraj
Doctoral Candidate in the department of Electrical and Computer Engineering at University of Florida
Research for various models.
Pair programming and debugging.

kunal4892
Understanding the dataset and applying the principles of association rule mining for extracting features.
Literature survey for forming the datasets.
Forming the URL based feature set.
Forming project description.

Noopur R Kalawatia

Updates

Keerthiraj Nagaraj started this project — Sep 22, 2019 04:28 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.