First we scraped twitter for tweets which contain traces of cyber bullying. We tried with Twitter4j api but it seemed to have a restriction on the number of tweets that can be pulled out from twitter. So we built our own api where we search tweets using twitter search api and parse the json content using jsoup api. We collected data from other places like youtube and myform spring. We converted all datasets into the same format, having around 22000 tweets labelled as cyber bullying text or vice versa. We further took same number of positive and negative examples, thus avoiding overfitting, getting us around 5100 labelled tweets. we did some pre-processing like removing hashtags, normalizing numbers etc. We got sentiment analysis tokens using mood indigo. Trained models using azure ml. did experimentation.

Share this project: