HackHarassment

Datasets

We used two publicly available datasets:

Formspring Labeled for Cyberbullying
MySpace Group Data Labeled for Cyberbullying

What it does

A user signs up, and then sends an SMS using Twilio API. When the server receives the text, its classified and forwarded to the intended recipient.

A D3 graph accompanies the hack that visualises the user messages and updates the colours (red/green) to show if a person has committed harassment.

How it Works

We are using an SVM and an LLDA (Labelled Latent Dirichlet Allocation).

For the SVM we are using a Bag-of-Words model.

For the LLDA, we using Google's list of banned words as labels. When we get a new message we get the topic distribution for the message, and classify the message as harassment based on the sum of the topic distributions.

Challenges we ran into

Improving the accuracy for the model. We discovered that ensemble learning had the best results after continuously testing with 10-KStratified Fold.

Accomplishments that we're proud of

F1 Score: 0.663871351995
Accuracy: 0.729411764706
Precision: 0.655128205128
Recall: 0.677898550725

Built With

Submitted to

StudentHack V
- Winner #HackHarassment
- Winner Accenture Big Data challenge
- Winner Big Data track

Created by

Updates

Izz Abudaka started this project — Mar 12, 2017 04:48 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.