Youtube Scam Comment Clasification

Inspiration

The idea came a few days before the event, and we were intriged by the fact that there isn't still any official solution to this problem on youtube. Also, this project covered several subjects that we are currently learning at university, so we decided to try and create a solution.

What it does

It is a program that uses Machine Learning techniques combined with Natural Language Processing to classify the comment section of a youtube video in look for scam comments. Specifically, the code returns a list of potential scam acounts.

How we built it

To make this proyect we mainly used Python in combination with some libraries such as sklearn, nltk or pandas, and of course the Google API to get the data needed to train and test the models.

First we created a python script to get all comments from a video and store them to a .csv. After that, we created a Jupyter notebook where we classified most of the data and trained two linear regression models to classify our data, one based on the comment itself, and other based on the username of the author. Once we had our models working, we went onto coding the main app, we created a simple graphical interface, and coded all the logic needed to use the trained models to return a list of potential scam accounts from a youtube video.

Challenges we ran into

The first real challenge we had was making the dataset, that's because getting all the data was relatively easy, but this data wasn't classified, so we had to manually do it ourselves. To minimize this task, while, at the same time, trying to get a reliable dataset to create the models, we added to the table the UserID, this way, we could filter all the comments done by a known scam user in one go. This way, we were able to identify about 8000 scam comments out of 30000 total comments the training data had. Then we also had some trouble with the text processing, and finding the right way to train two models, due to different technicalities of some library or misunderstanding how dimensions were managed at certain parts of our code. Nonetheless we got it right at the end.

Accomplishments that we're proud of

We feel really proud of the fact that the proyect seems to work well, of course it has some limitations due to the time constraints, like, for example, probably not working for scam comments in other languages, etc. But as far as we expected it to work, it has by far passed our expectations.

What we learned

All of us got a different learning experience due to our different backgrounds, two of us have already done some machine learning, and we are familiar to Python, while the other two weren't. Despite that, we can confidently say that we all now have a better understanding of this techniques and workflows.

What's next for Youtube Scam Comment Clasification

The project is a few modifications away from being able to be used by anyone (This is because of the Google API key). So the most probable thing that we'll do is to (after modifying it ) publish the proyect on Github, so that anyone who wants to use it, or make a better version of it, is able to.