Inspiration/What it does
Ad-blocking and antivirus extensions use lists of urls and hardcoded heuristics to block content. This approach is usually effective, but does not generalize well and can be fooled by anyone with knowledge of the ruleset. To solve this, we developed a Chrome/Firefox browser extension which sends links to a cloud-hosted RNN that has been trained to classify each as malicious or safe. The user is then warned with CSS updates about dangerous urls.
How we built it
The recurrent neural network uses LSTM units and is written in PyTorch. Pandas dataframes are used in data loading/preprocessing.
Getting the browser extension to finally sync up with the API. This was the first time any of us had created a browser extension, and using WebExtensions library to communicate with our python script was difficult. It was also our first time using real-world dataset that had not been heavily curated beforehand, so we learned a lot about data preprocessing and the data loading side of PyTorch.
What we learned
The number one takeaway has been that an ML model is only as good as the data fed into it. We were able to get higher than 95% accuracy on a held-out test set from our dataset, but issues with the data like example repetition, class imbalance, and a lack of "normal-looking" urls led to the model's accuracy on natural data being lower. In the future, we would spend more time looking for a well-curated dataset to use.