Inspiration/What it does

Ad-blocking and antivirus extensions use lists of urls and hardcoded heuristics to block content. This approach is usually effective, but does not generalize well and can be fooled by anyone with knowledge of the ruleset. To solve this, we developed a Chrome/Firefox browser extension which sends links to a cloud-hosted RNN that has been trained to classify each as malicious or safe. The user is then warned with CSS updates about dangerous urls.

How we built it

The SafetyNet browser extension is written in javascript and uses the WebExtensions API to communicate with a python script that runs in the background. This python script communicates via POST requests to a bottle server we have running on AWS, which returns inferences made by a recurrent neural network we have pretrained on an open-source url dataset found here. These results are used by the browser extension to conditionally highlight malicious links.

The recurrent neural network uses LSTM units and is written in PyTorch. Pandas dataframes are used in data loading/preprocessing.

Challenges/Accomplishments

Getting the browser extension to finally sync up with the API. This was the first time any of us had created a browser extension, and using WebExtensions library to communicate with our python script was difficult. It was also our first time using real-world dataset that had not been heavily curated beforehand, so we learned a lot about data preprocessing and the data loading side of PyTorch.

What we learned

The number one takeaway has been that an ML model is only as good as the data fed into it. We were able to get higher than 95% accuracy on a held-out test set from our dataset, but issues with the data like example repetition, class imbalance, and a lack of "normal-looking" urls led to the model's accuracy on natural data being lower. In the future, we would spend more time looking for a well-curated dataset to use.

Built With

Share this project:
×

Updates