Comments Less Likely to be Spam
Comments More Likely to be Spam
Visualization of Spam Words
We were tired of Youtube's disastrous comments about advertisements to shady websites and personal channels, so we found a way for the Youtube content creators and viewers to enjoy the comment reading experience again.
What it does
Hands the spam hammer down on comments. Given a Youtube video link to our web app, the web app will display the first 100 comments along with the probability of it being spam using an RNN. You can view the most frequent spam words in a word cloud visualization.
How we built it
The RNN model is trained on Youtube Spam Collection Dataset from UCI Machine Learning Repository. The RNN model incorporates LSTM layer, Bidirectional layer, Convolutional layer, etc. We utilized Stanford's Glove word embedding matrix to boost the performance of the RNN model. And we trained the model using Google Compute Engine.
For the backend, we're using Flask on Google Compute Engine.The front-end is done in react, and the word cloud is generated by d3.js and spam comments grabbed by Youtube's API.
Challenges we ran into
Tensorflow not available for Python 3.7 :'( RNN not converging well. We decided to use Stanford's Glove word embedding matrix to provide correlations among words. CORS errors; we can't request files from http (where the data were originally stored in), so we had to change it to https.
Accomplishments that we're proud of
We were able to increase our baseline accuracy from 86% to 96%. We were able to install Tensorflow and host a server on Google Compute Engine.
What we learned
We learned some cool APIs and NLP along the way.
What's next for Detective Sbam
Users can mark comments as spam, which the web app will automatically report the comments in Youtube.com as spam.