Fishing the Phish
The world has increasingly become dependent on the internet for a majority of its tasks. Whether it is simply browsing for homework, finding a recipe for cooking, buying various items of need or the critical banking transactions, the internet has become an indispensable part of our daily lives. Therefore it comes as no surprise that institutions implement various security measures to mitigate and also prevent the cyber-security attacks. Phishing is one of the many security attacks of this spectrum. In the above attack, a user wants to gain customer-sensitive information, by masquerading as a legitimate website. One of the real-time examples includes enormous phishing mails sent to people during Football World cup, guaranteeing tickets to Moscow and extracting the personal information of the user. Along with the danger it poses to the user, phishing also tarnishes the brand image of a company and disperses distrust in using the application/service. With the stakes this high, it is of prime importance that we limit the phishing attacks and have a methodical approach of identifying them to build robust systems.
The following problem can be solved efficiently by using the techniques of Machine Learning and Data Science. Various features can be extracted from the URL of the websites to make a distinction between a legitimate website and a phishing website.
What it does
Detects phishing websites based on their lifetime details and URL using machine learning models.
How we built it
For our project, we extract a feature set consisting of the length of the URL, number of slashes in the URL, number of words in the host-name of the URL, the number of dots in the URL along with parameters like the date of URL creation, expiration date of the URL and the last update date of the URL.
We divided features with numerical values into one set and features extracted from URL text directly into another set, and applied a variety of machine learning models in our analysis. The models include Logistic Regression, K - Nearest Neighbors, Decision Trees, Random Forest, Adaptive Boosting and Long Short Term Memory networks. The measure of correctness is expressed by F1-score as it is an excellent score for classification problems with even imbalance class size and finds a good balance between precision and recall. Instead of choosing one learning model, we let our algorithm decide the optimal model for each set of features based on F1-score values. We finally combined the models created from two feature sets, which resulted in a model with much better phishing detection capability.
Challenges we ran into
Feature engineering based on URLs, combining multiple ML models trained on different features to solve the same problem.
Accomplishments that we're proud of
Dynamic selection of best model among many ML models and fusing models trained on different set of features to produce better result than individual models.
What we learned
Solving a real world data science problem, Feature engineering, Hyper-parameter tuning, Model fusion
What's next for Fish the Phish
Engineer more relevant features and train on data with much larger number of samples for improving the accuracy of our model.